Classi cation of Scienti c Papers Using Machine Learning Minh

نویسنده

  • Mengjie Zhang
چکیده

The project aims to develop a domain-independent and adaptive approach for scienti c document classi cation using both information from document contents and citation links. We evaluate several content-based classi cation methods including K-nearest neighbours, nearest centroid, naive Bayes and decision trees and nd that the naive Bayes outperform other when training set is sufciently large. Using phrases in addition to words and a good feature selection strategy such as information gain is found to improve system accuracy in comparison with using words only. To combine citation links for classi cation, the project proposes two methods, linear labelling update and probabilistic labelling update. The two methods iteratively update the labellings of classi ed documents using categories information from neighbouring documents. Our experiments on the two methods show that, combining contents and citations signi cantly improves the system performance.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A classification of abduction: abduction for logic programming

Abduction is a methodology of scienti c researches. Peirce showed three types of abduction, and expressed them by one syllogism. Recently various researches on abduction or abductive logic have been developed in the elds of automated reasoning and machine learning. In order to systematically understand such researches and to clearly discuss abduction, this paper classi es abduction into ve type...

متن کامل

The Case against Accuracy Estimation for Comparing Induction Algorithms

We analyze critically the use of classi cation accuracy to compare classi ers on natural data sets, providing a thorough investigation using ROC analysis, standard machine learning algorithms, and standard benchmark data sets. The results raise serious concerns about the use of accuracy for comparing classi ers and draw into question the conclusions that can be drawn from such studies. In the c...

متن کامل

A Hybrid Approach for Sentiment Analysis Applied to Paper Reviews

Œis article discusses the problem of extracting sentiment and opinions about a collection of articles on scienti€c reviews conducted under an international conference on computing in Spanish language. Œe aim of this analysis is on the one hand to automatically determine the orientation of a review of an article and contrast this approach with the assessment made by the reviewer of the article. ...

متن کامل

A Machine Learning Approach to Building Domain-Speci c Search Engines

Domain-speci c search engines are becoming increasingly popular because they o er increased accuracy and extra features not possible with general, Web-wide search engines. Unfortunately, they are also di cult and timeconsuming to maintain. This paper proposes the use of machine learning techniques to greatly automate the creation and maintenance of domain-speci c search engines. We describe new...

متن کامل

Modeling Classi cation and Inference Learning

Human categorization research is dominated by work in classi cation learning. The eld may be in danger of equating the classi cation learning paradigm with the more general phenomenon of category learning. This paper compares classi cation and inference learning and nds that different patterns of behavior emerge depending on which learning mode is engaged. Inference learning tends to focus subj...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005